Performance Prediction and Evaluation of Parallel Processing on a NUMA Multiprocessor

نویسندگان

  • Xiaodong Zhang
  • Xiaohan Qin
چکیده

Non-Uniform Memory Access (NUMA) architectures make it possible to build large-scale shared memory multiprocessor systems in comparison with non-scalable UniformMemory Access (UMA) architectures. Most NUMA multiprocessor operations such as scheduling and synchronizing processes, accessing data from processors to memory models and allocating distributed memory space to di erent processors, are performed through interconnection networks such as a multistage switching network. The e ciency of these basic operations determines the parallel processing performance on a NUMA multiprocessor. This paper presents several analytical models to predict and evaluate the overhead of interprocessor communication, process scheduling, process synchronization and remote memory access where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared memory multiprocessor. Both analytical and experimental results give a comprehensive and clear understanding of the various e ects, which are important for the e ective use of a NUMA shared memory multiprocessor. The results in this paper may be used to determine optimal strategies in developing an e cient programming environment for a NUMA system. Index Terms | barrier, interprocessor communication, interconnection network, NUMA architectures, pre-scheduling, self-scheduling, remote memory access, shared memory, UMA architectures. This research has been supported in part by the National Science Foundation under grants CCR-9008991 and CCR-9102854, and a grant from the San Antonio Area Foundation, SAAF-26-7400-03.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...

متن کامل

Shared Memory Multiprocessor Architectures for Software IP Routers

In this paper, we propose new shared memory multiprocessor architectures and evaluate their performance for future Internet Protocol (IP) routers based on Symmetric Multi-Processor (SMP) and Cache Coherent Non-Uniform Memory Access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the ...

متن کامل

Towards Efficient Locality Aware Parallel Data Stream Processing

Parallel data processing and parallel streaming systems become quite popular. They are employed in various domains such as real-time signal processing, OLAP database systems, or high performance data extraction. One of the key components of these systems is the task scheduler which plans and executes tasks spawned by the application on available CPU cores. The multiprocessor systems and CPU arc...

متن کامل

Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture

This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. The study inclu...

متن کامل

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Architectures

Parallel computing performance on scalable shared-memory architectures is aaected by the structure of the interconnection networks linking processors to memory modules and on the eeciency of the memory/cache management systems. Cache Coherence Non-Uniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two eeective memory systems, and the hierarchical ring structure is an eecien...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Software Eng.

دوره 17  شماره 

صفحات  -

تاریخ انتشار 1991